Language/OS - Multiplatform Resource Library

home *** CD-ROM | disk | FTP | other *** search

/ Language/OS - Multiplatform Resource Library / LANGUAGE OS.iso / scheme / scm / wb1a1.lha / wb / Design.doc < prev next >

Wrap

Text File | 1993-06-25 | 32.9 KB | 775 lines

"Design.doc" for WB-tree File Based Associative String Data Base. Copyright (C) 1991, 1992, 1993 Holland Mark Martin This is an attempt to make explicit the major decisions we've either made explicitly or arrived at though experience in the course of working on the wb-tree database. Since there are actually a lot of them, I've divided this into the following areas: I. B-Tree Structure and Access II. Concurrency III. Buffer, I/O, and Free-List Management IV. Error Handling V. Misc last mod: oct 28/1992 rjz reorganized and expanded nov 19/1992 rjz recorded recent ananlysis of UPDATE screw and deferred updates dec 2-3/1992 rjz random elaboratins and edits dec 8-9/1992 rjz RANGE doc and extensions dec 16 rjz misc editing jan 8/1993 rjz doc new deferred-tree-update algorithms jan 13-15 rjz more of same; add descr of concurrency problems w/free list, error codes, misc, deferred data update proposal. --------------------------------------------------------------- I. B-Tree Structure and Access We used SAGIV 86 as our starting point for concurrent b-tree algorithms. However, we made significant changes and (we feel) improvements. We have particularly tried to simplify the algorithms to reduce the implementation and especiaslly debugging effort. There are also a lot of complex details involved in building a complete implementation. The goal of this document is to describe and explain the major design decisions and any subtleties that arose. 1. Tree format 1.1 The WB-tree (the "wanna-be a tree") We use the B-link tree format (see SAGIV 86). Each leaf block contains some number of (key,value) pairs; each non-leaf (ie, index) block contains (key, down-pointer) pairs. Each block contains one additional key value, the SPLIT key, that is strictly greater than any other key in the block. The tree is augmented by chaining together the nodes of each level of the tree in split-key order (the NEXT pointers). We allow the tree to be missing any index-insertion or aock-deletion updates so long as certain key-order invariants are maintained: A. Blocks are chained in order of increasing split key at each level, and all valid blocks appear in the chain. B. Split keys are unique at each level. C. Every DOWN pointer from level N+1 points to a block B at the next lower level N whose split-key is less than or equal to the key associated with the pointer. C1. If it is less, then it must be the case that (a) a block B' with that split key exists at level N; (b) it is reachable from B by following NEXT pointers; and (c) no pointers to either B' nor any blocks between B and B' exist in level N+1. We call the subchain from B through B' the SPAN of the (key,ptr) entry (split(B'),B). C2. It is invalid for a down pointer to contain a key not present as a split key at the next level. Note that a key in an index must match its block's split key exactly; if a key K is less than the split-key S of the block B it points to, searches for intervening keys will be misdirected (to the next block); and if K is greater than S, then splits of the block after B at key values K' where S<K'<K will be mis-inserted in B's paren, because K' should logically go AFTER K, but K'<K. The notion of the span of an index entry is useful. We note that each block split can be thought of as an EXTENSION of some span at the next-higher level, while each parent-insert-update can be thought of as a corresponding span REDUCTION. A span that has only one block in it will be called FULLY REDUCED; a b-tree is fully reduced when all its spans are fully reduced, meaning that all pending/deferreed insert-updates have been performed. Lastly, we can express rule C1 more succinctly in terms of spans: C,C2. SPANs must be well-formed (span must be closed; keys must match exactly) C1. The SPANS of the entries at any index must not overlap. 1.2 Key/value order; uniform leaf/node format We originally had index nodes in (value,key) order because it made insertions simple, but abandoned that because it made all the insertion/deletion propagation code special-case. In fact, because of the asymmetry of INSERT and DELETE, either one or the other will require that a single operation sometimes modify two blocks (see INSERTION METHOD). However, using (key,value) ordering at all levels of the tree simplified the code considerably. 2. Split keys Split keys seem to be an accepted method for B-tree organization. In a Blink-tree, wherer one might potentailly have to chain forward to find a desired key, the split key allows one to determine when the search can be terminated. Having the split key at the END of the block saves one block access per search. (If the split key were at the front of the block, one would have to access the next block in the chain to determine that a search could terminate.) 2.1 Last-key values It is useful to define a "largest key" to be the split key of the last at any given level; the code is made more complex if the last key is some special value (eg, the null string) that doesnt follow lexicographic order, not to mention the problems that arise with inserting into empty (last) blocks. We've reserved the set of strings starting with #\255 as split keys only; they cannot be used as real key values. Last-key values are of the form xFF<level>. This makes the last key in level N+1 strictly larger than any in level N, so the end-of-chain key for level N need not be treated specially when it is inserted as a key at level N+1. 2.2 Fixed split keys/Independence of levels We make the split keys at index levels act like those at the leaf level, that is, the split key in a block NEVER CHANGES. The advantage is that inserts and deletes cannot cause split-key changes, avoiding the need for code to propagate these changes, which are a pain to test. It also makes the levels genuinely independent, a useful conceptual simplification. The disadvantage is that this introduces a "dead zone" between the last key-value pair of an index block and its split key. This complicates the code slightly, and makes the average search path slightly longer than logN (by a constant factor of 1+(1/BF), where BF is the branching factor of the tree). More annoyingly, it interacts with insertions for block splits (see INSERTION). 3. Insertion method Insertion in a index needs to appear atomic so that the key-order invariant is maintained. Since we are using key/value-ordered index blocks, parent-update insertion has a special extra step: we insert the new key and pointer in index block B, then swap value fields with the next entry in sequence. Unfortunately, when the insertion is the end of the block, the next entry is in the next block B', so two blocks must be modified. (The advantage of value/key ordering is that insertion never modifies more than one block; however, you get screwed in the code for deletions instead.) The code checks for this case and locks both blocks before proceeding. (An alternate solution considered is to back up the split key of B so that the new key would go at the front of B', but that introduces other problems.) While the above method preserves correctness (even under concurrency), there is unfortunately a 2nd-order screw WHICH STILL NEEDS TO BE IMPLEMENTED: how to write the 2 blocks in an order that preserves correctness in case of disk crash. If only B is written, the database will contain 2 pointers to the block whose split caused the insertion; if only B' is written, the pointer correctness will be violated. (It would seem that this kind of problem is inherent to any operation that needs to maintain consistent changes across multiple blocks.) The fix seems to be to surround the write with a notation in the root that says this special case is being exercised and indicates what blocks were being modified (B,B') and what the pointers should be: Write (B,B',down-ptrs) to Blk 0 Write B' (Write new block if B was split by insert) Write B Remove flags from Blk 0 and write. We elect to write out block B' first for consistency with the case where B' is a new block created by a block split. The screw case will create a damaged DB, but the danger of deleting a doubly-pointed-at block is avoided. This is adequate information to recover directly. (JONATHAN?) (Other alternatives were considered. Backing up the split key at next level allows a subsequent insertion to be atomic, but the backing-up operation has the same two-block consistency problem in keeping ITS parent index consistent. If the parent key isnt backed up, subsequent splits of the following block will result in incorrect updates at the next level.) 4. Deletion Deleting can be divided into two parts: the block to be deleted must be unlinked from both parent and predecessor, after which the block can be reclaimed. 4.1 Unlinking the block At the moment, we've decided to simplify the delete operation by requiring that both the parent and the previous block be available for modification, else the deletion fails (is deferred). This makes block deletion an atomic change, which avoids several problems introduced by permitting partial deletions (see CONCURRENCY). Alternatives: WANG 90 gives a clever way to do deletion w/o PREV, by swapping blocks. This seemed too hard to make bulletproof for concurrency so we stuck with the PREV method, which after all only does extra block accesses 1/BF of the time. Crashes during DELETE: we still be to determine an order for the writing of the parent and predecessor blocks in a delete. We elect to write out the PARENT block first, so that if the system should crash the chain will still be intact. The block that we were trying to delete will still be in the chain, but it cannot be used, it is "dead." [Section II.3 discusses how to dead with "dead" blocks.] 4.2 Reclaiming the block One major change from SAGIV is that SAGIV assumed multiple copies of a block could exist, which makes reclaiming deleted blocks complex (he suggests a timeout-based protocol). We opted instead to track how many pointers to each block are extant by adding a lock for that purpose, the NAME lock (essentially, a DELETE lock). A routine must get NAME lock on a pointer BEFORE releasing READ/WRITE on the block from which the pointer is obtained. (SAGIV's method is almost certainly more efficient in the sense that our method incurs overhead on every operation as opposed to just on the problem case of a READ request during a WRITE; several empirical studies of such tradeoffs support this conclusion.) On the other hand, NAME lock is useful for other things, such as insuring that the block you are PREVing from cant be deleted while you're looking for its parent or predecessor... 4.3 Non-symmetry of INSERT and DELETE It is worth remembering that INSERT and DELETE are not symmetric, in the sense that a postponed insertion is NOT equivalnt to deleting a (KEY,PTR) pair. The latter operation leaves a block whose pointer is missing unaccessible via the index, while the former leaves the block accessible through the NEXT pointer chain. This asymmetry has been the death of more proposals for fixing various problems involving concurrency than I care to recall! EXAMPLE: 4.3.1. Start with blk p with split key k at level n; the index (levl n+1) contains: ...k,p,.. 4.3.2. split p at k', adding block p'; the index shoudl now contain ...k',p,k,p'.. 4.3.3. Now neither deleting p nor p' yields the original index contents: If we delete p, the result is ...k,p',... If we delete p', the result is ...k',p,... Therefore, it is not possible for a subsequent delete to "cancel" an insert; the deletes must either wait for the relevant inserts to complete or else do special work to maintain correctness. 5. Non-delete of last block in the chain We currently do not support deletion of the last block at any level, that is, the one with the end-of-block key. This is because this deletion requires special case to retain the end-of-block key. One way to achieve this is to copy forward the contents of the next-to-last block, and deleting that instead. p[There are details to be worked out, eg, preservation of correctness during this operation.] 6. Prev [DESCRIBE HOW IT WORKS] SCREW CASE: FIX UNDERSTOOD BUT NOT YET IMPLEMENTED: if down-pointers are missing to blocks immediately after the first block of a level, PREV will miss those blocks. The problem occurs when a PREV-BLOCK of block B at level N occurs, causing a PREV at level N+1. If down-pointers are missing, the block B' associated with B's split key may be some predecessor of B rather than B itself. In this case PREVing at level N+1 is wasteful but correct; it would require fewer block accesses to chain from B' than from PREV of that entry. However, if the B' entry is the FIRST at level N+1, PREV will erroneously conclude that it is at the start of the chain. It either has to (a) always look at B's entry at level N+1 to check that the current block's ptr is really up-to-date, or just check when it hits the START-OF-CHAIN. 7. Root Block protocol 7.1 Root uniqueness We guarantee that the root block number never changes and thus can be used as a unique identifier for a given WB-tree. Other systems provide a unique tree ID by introducing a level of indirection in root references; this is inefficient, as root references are frequent. When the root is split, we allocate a new block to hold the data that would have remained in the root block, then use the old root block for the new root. This does mean that one cannot depend on a block's ID being unchanged if it splits! 7.2 NOT IMPLEMENTED: ROOT DELETE and Reducing number of levels in a tree. 8. Other tree organizations [There are other possibilities for tree layouts. It is possible that some of these may simplify operations that in the current layout are complex, such as the insert-screw-case. Other possible choices include: - split key at start rather than the end of the block; - reversing the order of (key,value) pairs in the blocks; - not using the split key - using TWO split keys (one at each end of the block); - runnning the chains backearfd rather than forward; My (RJZ's) favorite alternate assumption is allowing multiple versions of blocks to exist, as in SAGIV 86. This would mean: - processes would be allowed to maintain STATE in a block, such as current position, such as for successive NEXTSs; on the other hand, NEXT would have to check for NEXT(X)<X and skip forward accordingly. - W/R interlock overhead should be greatly reduced (as WRITEs do not have to wait for reads); - the need for NAME locking is greatly reduced, since we no longer would be trying to count the of processes that "know" about each block; - reclaiming deleted blocks is harder; - other issues to be evaluated: effect on rebalanace, PREV, and the deferred-operation rpotocol we've developed. Perhaps we can explore the interrelationships in some methodical way someday...] II. Concurrency 1. Name access (and deleted-block reclamation) In order to be able to explicitly know when a block is safe to delete, we insist that a user must get NAME lock on a pointer BEFORE releasing READ/WRITE on the block from which the pointer is obtained. NAME lock is useful for other things, such as - insuring that the block you are PREVing from cant be deleted while you're looking elsewhere; - CHAIN-PUT uses it to insure that the block just split doesnt disappear during the call to PARENT-INSERT-UPDATE - others? 2. Fail-out protocol/access conflict strategy Blocked operations are at the moment simply going to fail with an error code of RETRYERR, meaning that they can be safely retried later. THe current idea is to use this whenever a READ-WRITE conflict occurs. (This would not be necessary using SAGIV's method.) However, since various other lockout and wait conditions can occur -- waits for block reads, waits on NAME locks, waits on interlocked operations -- some such facility would be needed anyway, so it seems reasonable to try to use it to handle READ-WRITE blocking as well. NOT REALLY IMPLEMENTED YET due to the complexity of reorganizing the code to pass up the appropriate information in all cases. (Top-level routines return these codes but internal routines dont really use it yet, so retryability isnt really there yet.) 3. Deferred Index Updates and concurrency The basic strategy is to allow a key insert/delete that causes a block to split/become empty to complete, but to then queue up the parent-update/block delete on some process that will retry deferred updates in the background until they succeed. The problem is that deletes contain two separate operations that can wait: the parent update and the predecessor update. The two updates can be done separately if the block is first marked "dead" so that no insertions into it can occur. Unfortunately, there is a class of rare and complex screw cases involving the correct ordering of deletes and inserts that happen to involve the same key. One might for example delete a block, deleting its key, and then subsequent splits can reintroduce and re-delete it. If operations at the index level are deferred, the ordering of these deferred operations determines whether the resulting tree is correct. Making block-delete atomic (as defined above) greatly simplifies this process. The block being inserted/deleted can then serve as its own semaphore. We have decided to adopt a lazy update strategy. That is, rather than keeping around queues of pending parent-updates, we just throw them away if they can't be executed immmediately. We can get away with this because the updates for level N+1 can be reconstructed from the chain at level N. Now, a deferred update only affects performance if the affected path is actually encountered: if we find we have to chain across two blocks, it means a parent update hasnt occurred yet; if we encounter an empty block, it means a delete update hasn't occurred. Our idea is to fix these "inefficiencies" on demand, that is, only when we run into one will we expend the efort to fix it. For the moment, assume ALL block deletes and insert updates are deferred. The basic algorithm is: (a) If we have to chain forward in searching for a key, there must be pointer missing at the next level (deferred insert). So we attempt to insert it. Forward chaining to reach the NEXT and PREV of a key is normal and hence shouldn't cause update attempts. (b) Whenever we reach an empty block, we attempt to delete it, except for the case where we are doing an insert which should go in that block. The delete attempt will fail UNLESS the relevant parent updates are complete, that is, if and only if the block is within a fully reduced span. One major concern has been that an I/O failure during a block delete can leave a "dead" block, that is, one which can't be reached. The problem is that once a block becomes dead we need to prevent it from being inserted into or restored into use, because that could result in entries being out-of-order (suppose that, while the block's dead, some inserts of keys less than its split key occur. They'll be directed into the next block by the index, and be posted there. If we then restore the block, the key ordering will be incorrect.) But it turns out that we can prevent this by observing which B-tree operations can encounter which types of deferred-update situation, as follows: If we think of the tree as a set of nested SPANS, where the SPAN of an index entry is the set of entries in the blocks it SPANS, we note that during operations using search only - GET,PUT,REM -- the locus of search stays strictly within the SPAN of some entry in the root. We enforce that a block not be deleted until its parent pointers are completely updates, that is, its pointer has a span of exaclty one block. Now suppose a block delete fails halfway, leaving a empty block without any parent pointer. Such a block is unreachable by FIND-NODE and hence by INSERT! This means that if INSERT encounters an empty block, it must be valid to insert into it. This has several interesting consequences: (a) This means that only the operations that can possibly chain ACROSS spans have to worry about dead blocks: the NEXT and PREV operations. (What exactly is the effect on PREV?) (b) This means that "dead" blocks can't be deleted (unless we look for them specially). But they (a) are only created by a rare kind of event, and (b) won't hurt anything, so long as we arrange that NEXT and PREVs ignore the empty blocks (ie, ignore the value of their split keys). (c) The way we have defined deferred-delete reconstruction, if the block isn't within a fully-reduced span, the delete simply can't succeed. This means that there's no point in trying to delete a blank block when we're chaining though a span, that is, we should only try to delete empty blocks encountered (1) when following a DOWN pointer in FIND-ENT, or (2) due to a NEXT operation (or PREV?). [Having realized this, we can actually let block-deletes be two-part operations (again). The only additional complexity is that then we'd need to implement a method for detecting dead blocks, that is, differentiating them from empty blocks in the middle of large spans, which we mustn't try to delete. We just check if the block is contined in some span! HOW TO DO THAT EFFICIENTLY?] In detail, the method of detecting deferred operations goes like this: 1. RECONSTRUCTING DEFERRED PARENT-UPDATES (missing DOWN pointers): Whenever we have to follow the NEXT pointer of a block B (in FIND-NODE), we should attempt a PARENT-INSERT-UPDATE using the (key,value) pair (split(B),next(B)). FIND-NODE needs to be fixed to chain right rather than down when the "dead zone" is encountered; and it should not attempt a PARENT-INSERT-UPDATE in this case. PARENT-INSERT-UPDATE can fail if: (a) some other process is already doing the update (the first one to lock the parent wins, the other will fail out); (b) the key split(B) is already present (means a DELETE is pending -- this shouldn't really occur, though); (c) the update requires a block split but no blocks are available. 2. RECONSTRUCTING DEFERRED BLOCK DELETES: Whenever we reach an empty block in FIND-NODE or a NEXT/PREV operation, we'll attempt a PARENT-DELETE-UPDATE. PARENT-DELETE-UPDATE should fail when (a) the key is found but points to a different block (meaning the containing span isn't fully reduced); (b) the block is NAME-locked (this means its safe to leave name-locks around, the worst that can happen is they cause block deletes to be deferred some); (c) someone else is doing a PARENT-DELETE-UPDATE on this block (this can be guaranteed by having the delete process first name-lock the block being deleted; (d) some block needing to be modified (parent or prev) is locked. 3. We need to fix the code that does the NEXT-KEY operation to not stop at potentially-dead blocks. CHAIN-FIND need NOT ignore blank blocks. In practice, we'll want to trigger the update routines at the normal times as well, ie, try insert-updates after block splits and delete-updates after a block becomes empty. There are a number of new statistics we should keep; these include: * deferred parent-insert-updates (PUDs) * insert-updates that succeed * insert-updates that fail (measures overhead of the method) * insert-update failure rate * deferred block deletes * deferred block deletes that succeed * deferred block deletes that fail (measures overhead of the method) * delete failure rate * count of dead blocks found (requires extra work) (The number of chain-forwards is also a meaure of the overhead.) [Note: this mechanism aldo supports a simple queuing method: keep a list of the block numbers at which updates were deferred, and in times of low usage do FINDs to them, which will update them as a side-effect...] ------------ [Other strategies were considered but were either significantly more complex or overheady, or introduced unnecessary delays: One hack is to serialize the queue of potponed index INSERT and DELETE operations by brute force: to do them in exactly the order they arrived in. Possibly simpler alternative method: sort by key and timestamp them! I think we also decided that the interpreter could simply devote a process to each deferred operation, since we want to shift resources toward accomplishing the deferred ops if too many back up. WANG 90 maintains a queue of deleted operation on the DESTINATION block. This has the disadvantage that whenever a block is split or merged the deferred-operation queues have to be split or merged as well -- uggh!.] III. Buffer, I/O, and Free-List Management 1. When to reclaim buffers - age vs. level vs. dirtyness THIS IS A PERENNIAL NUISANCE -- THERE's a complex system in place which hasnt really been evaluated or tuned, and theres also a proposal by Jonathan to simply use the first free (or freeable) buffer one finds. 2. UPDATE-ACCESS: It seems to me there was some case where it wasnt OK to just do a release to #f and an update from there -- but i cant think of it offhand... 3. Deferred writes of data blocks Note: this is a different sort of deferral from the "deferred index updates" discussed earlier. Those deferred operations were correctness- presetrving; these are not. The idea here is that we can reduce i/o traffic if we can safely lose a "small" number of updates. 3.1 Description and purpose The general idea here is that we can reduce I/O traffic by deferring writes of leaf blocks, in the sense that the updates can be lost without compromising the structure of the database. This only works where the data updates in question are not critical to database or application. Currently, we always defer leaf updates -- both PUTs and REMoves -- to data trees, where a data tree is any tree not of type DIRECTORY. (DIRECTORY leaf blocks are written to disk after every update.) The idea is that the user application should have control over how often the data blocks are written. Also, deferral of PUTs and DELETES should be separately controllable. This feature is needed for example by the database itself in maintaining the free list: we can afford to defer INSERTS of deleted block, because the worst that could happen is that a free block gets lost. But DELETES from the free list must update immediately, else a block could be doubly-allocated. 3.2 Implementation The handle is being extended to contain the following boolean values: (a) SAP: save block after PUTs (b) SAR: save block after REMOVEs (c) SAC: force block save after cached block changes (d) FAC: flush buffer entirely after cached block changes (not currently implemented -- future functionality) These bits are used for example as follows: DIRECTORY: SAP=SAR=1; FREE LIST: SAR=1; SAP=SAC=0; USER DATA: SAP=SAR=SAC=0; The state of these bits can be changed at any time using (HAN-SET-WCB han new-bits). The user can insure that the current block has been written out by calling WRITE-CACHE-BLK, and can insure that all modified blocks have been written out by calling WRITE-MODIFIED-BLKS. At the moment, for safety, directory trees force HAN-SAP and HAN-SAR when opened or created. Also, OPEN-SEG forces the free list write control bits to be as shown above, regardless of the block type of the free-list. 4. Caching of last (leaf) block used Works great NOT IMPL YET: We could actually be cleverer and cache the parents of every node, and even the PREVs, using the same TRY-GET-ENT protocol! 5. Note. Multiple READ access is not supported yet. READ-READ conflicts can occur. We seem to recall noting that this limitation kept us from encountering some other problem, whose identity is lost for the moment. 6. Free-block management Outline: - free list is implemented as a B tree (its root block is in block 2 of the file) - with cache (free-list-cache, one per segment) - cache gets filled (to 1/2 full) when it empty and a block is needed; it gets flushed (to 1/2 full) when its about to overflow. - cache can be filled either from the free-list-tree or if that's empty, the file is extended. - cache filling is now done efficiently using scan-delete. This also means that only one disk write will be done per fill (in most cases). - cache filling is interlocked so only one process can be filling the cache at a time. - currently cache emptying is inefficient as each write of of a block to the tree causes an I/O [deferred writes is intended to fix this] Current problems: Last week (1/8) we concluded that (all efforts to date notwithstanding) there were still failure cases in free-block management under concurrent operation, because of problems like: - multiple processes can eat up however many blocks are in the cache, meaning we still haven't handled the case where a PUT to the free list causes a block split (necessitating a recursive call to FREE-BLOCK). - Conversely, it is possible for the cache to become overfilled by a cache filling operation, if in the meantime OTHER processes fill the cache with deleted blocks. - We need to be sure in general that concurrent fills and/or empties don't overfill or overempty the cache. On the positive side, we realized that we NEED NOT allow enough free blocks for a free-list insert to split the whole tree -- any split except the leaf split can simply be postponed if there isnt a free block available at that time! (And similarly for any insert-updates that happen to be triggered by free-list accesses.) Come to think of it, if we happen to run out of blocks during a free-list insert, its OK to let the insert fail, we just lose one disk block!! That may just be the answer!! 7. Buffer management routines Two routines have been built for purging buffers. FLUSH-BUFFER(ENT) writes out ENT if it is dirty and unlocked. It returns TERMINATED if ENT is locked, RETRYERR if the write is attempted and fails, and SUCCESS otherwise. PURGE-BUFFER(ENT) writes out ENT if it is dirty and then frees up the buffer. This IGNORES the acceess status of the buffer, so it sould not be called by users; it always sreturns SUCCESS. Use (DO-SEG-BUFFERS SEG# FUNC) to apply a function to all the buffers of a given segment; for example, (DO-SEG-BUFFERS SEG# FLUSH-BUFFER) can be used to guarantee that segment SEG#'s disk file is up to date. DO-SEG-BUFFERS halts if FUNC returns other than SUCCESS; the result of FUNC is returned. SUCCESS is returned if all buffers have been successfully processed. To process all segments, use SEG #=-1. (CHECK-BUFFER ENT) chekcs that the buffer is written and unlocked, and repairs those that are not. 8. Other issues to document: - bucket-locking method - treatment of access conflict conditions - amnesia-ent - caching criteria (eh?) IV. Error handling The problem: Calls need to return adequate information to distinguish: 1. restartable operations, for example - update-parent on insert - update-parent on delete - unlink for delete 2. non-restartable failures (eg, disk i/o error) 3. null results (eg, key not found) 4. "real values", eg, the length of a returned string, or an ENT pointer vs. NULL. The solution: use a bounded range of negative values as "failure" codes, leaving the non-negative return values available as "success" codes. The canonical success code is 0, but a routine that needs to return a value (string length, ENT pointer) can do so and have that value interpreted as a "success" code as well. There are "degrees" of failure,and the negative codes are partitioned according to increasingly severity, as follows: SUCCESS 0 ; successful execution NOTPRES -1 ; successful execution, no data present or no change made TERMINATED -2 ; failure, no damage, caller can retry operation RETRYERR -10 ; failure, no damage, caller can retry operation ARGERR -15 ; failure, no damage, call was in error NOROOM -20 ; failure, no damage, out of room in file TYPERR -30 ; failure, file or object was not of correct type IOERR -40 ; i/o error, DB may be damaged STRANGERR -45 ; internal error, DB may be damaged UNKERR -90 ; placeholder code MAXERR -100 The first class represent operations that completed without error. The second class represent operations that failed to complete, but are guarantedd to leve the DB in a correct state and are retryable (or easily correctible). The third class represent operations that failed to complete, did not damage the database, but are not easily fixable or restartable. The last class represent error conditions in which the DB was corrupted, or diuring which DB corruption was detected. The predicate (ERR? code) returns #t if the return code is within the range NOTPRES-MAXERR; the predicate (REALERR? code) returns #t if CODE is an actual error, as opposed to a "not there" or "stop processing" message. V. Misc 1. "Largest keys" (End-of-chain marker strings) Because we've reserved keys with a first byte of FF for split keys, these keys are unavailable to the user. 2. NULL keys WB supports use of the null string as a key (Sliced Bread treats the key specially). 3. Searching on minimum and maximum keys Given that the null string may be used as a key, there needs to be a way to specify a "least" key such that NEXT(least) yields the first key, even if it is NULL. The special key strings with length s of -2 and -1 are provided to represent the minimum and maximum key values, respectively. These keys are only useful with NEXT and PREV; data cannot be associated with these keys. 4. Need to document TSCAN, TSTATS.